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. Abstract 

O: 

^ 1 _ We investigate online kernel algorithms which simultaneously process multiple classifi- 

cation tasks while a fixed constraint is imposed on the size of their active sets. We focus 
^ I in particular on the design of algorithms that can efficiently deal with problems where the 

number of tasks is extremely high and the task data are large scale. Two new projection- 
based algorithms are introduced to efficiently tackle those issues while presenting different 
^ ■ trade ofFs on how the available memory is managed with respect to the prior information 

\ about the learning tasks. Theoretically sound budget algorithms are devised by coupling 

■ the Randomized Budget Perceptron and the Forgetron algorithms with the multitask kernel. 

nI I We show how the two seemingly contrasting properties of learning from multiple tasks and 

keeping a constant memory footprint can be balanced, and how the sharing of the available 
space among different tasks is automatically taken care of. We propose and discuss new 
insights on the multitask kernel. Experiments show that online kernel multitask algorithms 
running on a budget can efficiently tackle real world learning problems involving multiple 
tasks. 



1 Introduction 

In recent years there has been a growing interest in online learning algorithms processing data 
from multiple and related sources. Many interesting multitask problems involve large scale 
data sets or pose memory and real-time restrictions. For example, massive personalized spam 
detectors and ad serving systems need real-time, scalable, and continuously adaptable learning 
methods. Multi-sensors, memory limited handheld devices deployed on the field are often re- 
quired to process and classify readings without relying on a centralized, dedicated mainframe, 
and therefore face severe memory restrictions. Online multitask algorithms are also a natural 
choice for a growing number of applications that do not necessarily involve online data process- 
ing, but where data sets are so large that the computationally expensive batch algorithms can 
not be used (see, for example, [1]). 

In this paper we leverage on known results ([3], [S], [11]) to design online kernel algorithms 
that effectively address these problems. First, we cast new light on how the multitask kernel 
of [9] acts as a proxy to wire prior information about task relations into learning algorithms. 
Second, we build upon the Projectron algorithm [TI] to design two new and highly efficient 
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multitask budget online algorithms that aggressively retain learned information by sharing the 
global budget across different tasks. This is achieved using two multitask mediated projection 
steps. We also show how existing budget online kernel algorithms can be combined with the mul- 
titask paradigm in order to obtain scalable and accurate solutions that retain strong theoretical 
bounds. In doing so we show how the available space is automatically shared by and assigned 
to different tasks. Third, we provide an empirical evaluation of the proposed algorithms in a 
variety of different experimental setups. We stress the fact that our algorithms are particularly 
apt to be applied to problems where both the data per task and the tasks themselves are great 
in number. This goal is achieved by combining the mild dependence on the number of tasks 
with the enforced budget size. 

As of recently, several papers have considered large scale learning of multitask problems. 
For example, [12] introduces a highly scalable linear algorithm for spam classification based 
on hashing techniques. Our work focuses on kernel algorithms, and therefore it is not directly 
comparable to [TS]. Moreover, by relying on the multitask kernel, the algorithms discussed here 
can be easily applied to model situations where tasks exhibit a specific pattern of relationships. 
A different approach is the one outlined in [12], where the focus is on learning and tracking 
continuously changing relations among tasks. The nature of the problem considered there 
makes their algorithms not suitable to large scale applications, since the active set can easily 
grow unbounded and the dependence on the number of tasks tends to be quadratic. In Section [J] 
we show that under certain assumptions our techniques can effectively deal with shifting tasks. 

2 Preliminaries and notation 

We consider the usual online classification protocol, where learning proceeds in trials, with an 
additional complication due to the presence of multiple, possibly related, tasks. At each time 
step t an instance vector Xt G M*^ for a given task it G {1, . . . ,k}, chosen among a fixed set 
of k different tasks, is disclosed, and the algorithm outputs a corresponding binary prediction 
yt £ {— The true label yt is then revealed and the algorithm acts accordingly, choosing 
if and how to update its internal state. We follow |3] and define the t-th multitask instance as 
the pair [xt, it]- Similarly, the multitask example at time t is defined to be the pair comprised of 
the t-th multitask instance together with the label yt- We do not assume any specific generative 
model for the sequence of the multitask examples, i.e., no assumptions are made on the instance 
vectors xi,X2, ■ ■ ■ , the task markers ii, Z2, . . . and the labels yi,y2, ■ ■ ■ ■ In this work our main 
interest focuses on classification algorithms that are rotationally invariant and allow for the 
adoption of the so-called kernel trick. Let /C : (M'' x {1, . . . , K}) x (R'' x {1, . . . , K}) M be a 
symmetric, positive semidefinite kernel operator between multitask instances and denote with 
Tiic its associated Reproducing Kernel Hilbert Space (RKHS). Following the online literature, 
theoretical results are stated in the form of relative mistake bounds, where the number of 
mistakes made by the algorithm is compared against a measure of performance obtained by the 
best reference classifier in a given comparison class. We assume the standard hinge loss function 
as measure of performance, which is defined, for any g G Tifc, by itid) = max{0, 1 — ytg{xt, it)}- 
Since theoretical results are given with respect to arbitrary sequences, we also introduce the 
cumulative hinge loss L(g) = 'Ylit^tijj)- 

With the intent of properly modeling a number of scenarios that are frequently occurring 
in practice, we also consider the so-called shifting model in which the reference classifier is 
allowed to change, or shift, throughout the ongoing learning phase. Under this assumption the 
single reference classifier g is replaced by the sequence gi,g2, ■ ■ ■ and the cumulative hinge loss 
becomes -Z>({5t}i,2,...) = Ylt^tidt)- omit the argument of the cumulative loss whenever no 
ambiguity arises. Unsurprisingly, the presence of shifting reference classifiers will be reflected 
in theoretical bounds through a term that takes into account the overall amount of shifting 
that the reference sequence undergoes. Such term, known as the total shift and denoted with 
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Sj_{{gt}), or Sj\^ whenever {gt} is understood from the context, is defined as a sum of the 
distances computed with respect to a given positive definite operator A between consecutive 
classifiers in the sequence, or, formally, as S_A{{gt}) = W-^idt — gt-i)\\- 

As previously mentioned, this paper concentrates on kernel-based online linear algorithms. 
The prediction function / : M*^ x {1, . . . , K} — )• M of such algorithms can be encoded in 
the so-called dual form /(•) = /3j/C([a;j, ij], •) , that IS, cLS Si linear combination of terms 
(3jlC[[xj,ij],-^ where the weights /3j's are real coefficients. We refer to the (multi)set of 
those instances that appear in the linear combination for / as to the active set S and to 
the instances themselves as to active instances. For convenience we also adopt the shorthand 
J = {j '■ [xj,ij] G S}. A common drawback of the dual form expansion is the tendency of the 
active set to grow unbounded. Online algorithms such that the expanded or dual form is lim- 
ited to B terms are known as budget algorithms. The budget requirement effectively imposes a 
memory constraint and forces the online algorithm to throw away an active example whenever 
a new one has to be added to the active set. In the rest of the paper we use the expressions 
"budget" and "active set" interchangeably when we refer to budget algorithms. 



3 From single to multiple tasks 

We first provide a brief description of the multitask kernel as introduced in [9] and the rationale 
behind it. Denote with /C' : R*^ x M*^ — )• R the kernel operator between single task instances and 
with T-L)c' its associated RKHS. In order to streamline the presentation as much as possible we 
assume ||a;j|| = y^lC'{xt, Xt) = 1 for all t. 

For our purposes the multitask kernel can be seen as a meta-kernel in that it is responsible 
for properly balancing the impact that a given instance Xt has on the learning of task it as well 
as of the possibly related, remaining tasks 1, . . . , it — 1, it + 1, . . . , /c. Ideally, in order to meet 
this goal the multitask kernel should be defined according to a priori information encoded in a 
graph that establishes mutual relations among tasks. Let G = iVc-, Eq) be such graph, with Vq 
representing the tasks and Eq being their relations, and define the associated Laplacian matrix 
Lq as 

' di if i = j, 
{LG)i,j = <^ -1 if (i,j) G Eg, 
otherwise. 

v 

where di denotes the number of tasks i is related to. Let Aq = I + Lq be the so- 
called interaction matrix for the graph The graph induced kernel product is defined by 
IC(^[xs,is],[^t,H]) = {^G^)is,it^' {^s,Xt)- Given a sequence of multitask examples, it easily 

follows that ^]C(^[xs, is], [xt,it]) < maxi(/ -|- Lg)^^^'^ = cg for all time steps s and t in the 
sequence. The magnitude of cg scales according to the connectivity of the multitask kernel 
inducing graph G. In particular, cg ranges from y^2/{k + 1) if G is complete to up to 1 when 
G has at least one isolated task. Note that, since the model considered here involves multiple 
tasks, it is often found to be more natural to actually describe any reference classifier as set of 
classifiers, each one representing a classifier for a subsequence Xt^^Xt^, ■ ■ ■ where ti,t2, - ■ ■ are 
such that itj = it2 = . . . . For the sake of convenience we adopt an extended, vector-based nota- 
tion and denote with g = [(/i, . . . , gi^ the multitask reference classifier made by the single task 
classifiers gi G I-Lk', ■ ■ ■ lOk G T~iK.'- Moreover, by denoting with O(-) the null operator in T-Lxii 
and keeping in with the above notation, it turns out that the kernel inducing feature map V'yc 
is suchthat Vk; = lC{[xt,it],-) = [O(-), O(-), /C'(a;t, •), O(-), 0(-)] therefore 
mapping multitask instances to the space of vector valued functions endowed with the inner 
product {f,g) = TKACE^Kjg), where {Kjg)i^j = {fi,gj) for any f,g£ T-Lk. is the standard inner 



^The identity matrix ensures the positive definiteness of Aa and allows for an easier treatment. It is however 
possible to define Ac = Lq and then use the pseudoinverse A'q in place of the inverse . 
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product for the space Hk.'- It is also worthwhile to observe that, when the multitask kernel 
is employed, and the foregoing interpretation on the structure of ^ G 1-L]c holds, the hinge loss 
incurred at time thy g reduces to the hinge loss incurred on the example {xt,yt) by the it-th 
single task reference classifier, i.e, itid) = max{0, 1 — ytgit{xt)}. The multitask kernel is there- 
fore a meta-kernel since it is a scalar multiple of the underlying kernel computed on the single 
task instances, disregarding task relations. This latter information, indeed a pair of coordinates 
pointing to an entry in the inverse of the interaction matrix Aq, is instead taken into account 
to determine the actual value of the scaling factor. In particular, the following statement holds 
(proof deferred to the Appendix). 

Proposition 3.1 Let G be a graph of k nodes and G' be the augmented graph obtained from G 
by adding a dummy node that is connected to every nodes in G. Then (A^^)^ is equal to 

. 1 . 1 ^ 1 „ k + 2 



2{k + 1) ^ 2{k + 1) ^ {k + 1)2 " {k + 1)2 

where \\-\\^ denotes the entrywise 1-norm and Rqi is the resistance matrix whose entries {RG')i,j 
are the resistance distances between tasks i and j in the augmented graph G' . 

The augmented graph is needed due to the fact that Aq = I + Lq and therefore it is not a 
Laplacian matrix. Note that (^g^)j^j is 1 if task i is isolatecd and gets smaller and smaller as 
the number of its related tasks grows. On the other hand, (^g^)ij, with i ^ j, decreases as 
the resistance distance between tasks i and j grows, and as the average resistance distances 
between each of the two tasks i and j and the rest of the tasks in the graph increase. This 
amounts to say that two identical single task instances from loosely connected tasks may be 
considered more "different" than slightly different single task instances from tightly connected 
tasks. The magnitude of {Aq taken in isolation has more a balancing role than anything else. 
In particular, note that {A'^)ij for i = j \s always greater than i ^ j for all tasks i^j belonging 
to the same connected component. Indeed, as we expect, /C([a;j, i], [a;^, i]) > K.{\xt, i], [a^t, j]) for 
any j ^ i. Finally, it is worth mentioning that, while some of the more compelling properties 
of the multitask kernel arise from its interpretation in terms of the graph G, nothing prevents 
one to replace the Laplacian Lq with an arbitrary positive semidefinite matrix. 



4 Dealing with multiple tasks with a memory budget constraint 

Motivated by the elegant and general theoretical guarantees and by the easiness of their imple- 
mentation, we now introduce several budget algorithms for the multitask setup. 



4.1 The Multitask Budget Projectron 

The first multitask algorithm (Algorithm [H MTBPRj) considered here is a modification of the 
Projectron algorithm [11] where a hard constraint is imposed on the size of the active set. 
Similarly to |14j . the general idea of the algorithm aims at optimizing the allotted space by 
capitalizing on the examples already stored in the active set. 

^It is enough to observe that the entries on i-th row and the i-th column are zeros. 
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Algorithm 1 MTBPRJ 



Require: Graph G, Budget size > 0, Projection threshold 77 > 

for all t = 1,2, . . . do 

Get {[xt,it],yt) and let fi,{xt) be Y^^^j l3j{AG)i^^i^K.' {xj.Xt) 
if ytfitixt) < then 

if \\Pj-JC{[xt,it],-)\\ <??then 

f3j f3j + ytOj, Vj G J {see text} 
else 
A ^ yt 

if |5| < B then 
else 

r ^ argmin^g J /3j Pj-^^^.^^^^^/C ([a^j , ij] , 

5 ^ 5 U [a;i,ii] \ [a;r,«r] 
/3j ^ j3j + /3r7j, Vj G J {see text} 



Denote with Pj(-) the projection operator on the space spanned by {/C([a;j, ij], •)}jgj and 
with Pj~{-) the corresponding orthogonal projection operator. When a mistake occurs at time 
step i, the expression ||Pj^(/C([a;t, it], •)) || = ^Pj {lC{\xt,it\,-)) — /C([a;t, it], •) || is evaluated as 
way to assess how much of /C ( [a;t , it] , •) can not be written as a linear combination of the current 
active multitask instances. In particular, if \\Pj' (/C([a;t, it], •)) || is smaller than the user supplied 
threshold (i.e., a portion of size no bigger than rj will not be preserved after projection), then 
the budget is left untouched and the weights /3j's are updated to reflect that we are actually stor- 
ing Pj(/C([£Ct,it], •)) in place of /C([iCt, it], •) • In fact, Pj[K,[[xt,it]r)) = Hjaj oijK,[[xj,ij\,-) 

where the a^-'s are the entries of the vector [• • • /C([a;j, i;], [a^t, if]) • • • ] ^ j V/ G J, and H de- 
notes the Gram matrix of the current active multitask instances. The projection step depends 
on task relations in such a way that the condition on line 6 is unlikely to be true if it is loosely 
connected to tasks i;, even if K.'{xt,-) is in the space spanned by {K,' {xj,-)}j^j . Otherwise, 
either |5| < B and /C([a;t,if], •) is simply loaded into the budget or |5| = i? in which case a 
projection-based budget maintenance policy is triggered. As result, mtbprj singles out for evic- 
tion the multitask instance [ajr, v] that can be removed and projected back onto the remaining 



active instances with little overall damage as measured by fir 



-^J\{r}U{t}^(['^''''^'']' ■) 



Here 



P^^l^l^l^|/C([a3.f., ir], •) is the amount of K,{^[xr, ir], •) that is lost after projecting it back. The 
weights /3j's are then updated pretty much in the same way as on line 7 with 7j's being the 
coefficients of the expansion of /C([a;,., i^.], •) as a linear combination of {/C([a3j, ij], •) 

Unfortunately, mtbprj suffers a major drawback in that instances from an isolated task 
tend to undermine the projection mechanism. In fact, if tasks it and ij's are unrelated, 
lC{[xt^it],[xj,ij]) = which implies ||Pj^/C([a;t; ^t], •) || > V almost surely. This in turn pre- 
vents the efficient storage of K,{\xt,it],-) in terms of its projection. For the same reason ac- 
tive multitask instances from isolated tasks are seldom chosen for eviction since no part of 
them can be retained as a linear combination of the remaining instances. To overcome these 
limits we designed a new projection-based budget multitask algorithm (Algorithm [21 mtbprj- 
2). We denote with Pj{-) the projection operator on the space spanned by {K,'{xj,-)}j^j 
and with (^")j(') the corresponding orthogonal projection operator. The algorithm main- 
tains k sets of weights (/3i)j's for i = l,...,k and in doing so it trades off space (used to 
store the weights to increase retention and ultimately accuracy. More specifically, the 

projection applied to the instance observed at time t is equivalent to the one employed by 
MTBPRJ except that the task markers are now ignored, effectively increasing the chance of 
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Algorithm 2 MTBPRj-2 



Require: Graph G, Budget size > 0, Projection threshold 77 > 

1: 5 ^ 

2: for all t = 1,2, . . . do 

3: Get {[xt,it],yt) and let fi,{xt) be Y.j&APit)3^'{xj,Xt) 

4: if ytfitixt) < then 

5: if ||(P')j(''C'(a;t,-))|| < ^ then 

6: (A)i ^ (A)i + ajyt{A-\^i^, Vj G J, / G {1, . . . , k} 

7: else 

8: if |5| < B then 

9: S^SU[xt,it] 

10: ^ V/ G {1,...,A:} 

11: else 

12: r argmin^gj \\dj\\ {see text} 

13: 5 ^ 5 U [ajj, ij] \ [air, ir] 

14: (A),- ^ + ljWl)r{A^\,i,, Vi G J, / G {1, . . . , A:} 



the instance to be efficiently stored through its projection. The projection based maintenance 
policy now prescribes to remove an active instance IC'{xj,-) so that the /2-iiorm of the vector 



d 



is as small as possible. 



As one can easily observe, each of the k entries of this vector gauges to what extent the removal 
of IC'{xj, •) affects one of the different k prediction functions that the algorithms maintains. 



4.2 The Multitask Randomized Budget Perceptron 

The next algorithm we consider (Algorithm [3l MTRBP) is the multitask version of the Random- 
ized Budget Perceptron [3] algorithm. 



Algorithm 3 MTRBP 



Require: Graph G, Budget size B > 



5^0 

for all t = 1,2, . . . do 

Get {[xt,it],yt) and let fi,{xt) be ^j^j /3j{AG):['^i^lC'{xj,Xt) 
if ytfitixt) < then 
A ^ yt 

if \S\ < B then 

5 ^ cSU [xt,it\ 
else 

5 ^ 5 U [xt, it] \ RNd(5) 



MTRBP resembles the Perceptron algorithm and relies on a very simple scheme to deal with 
the memory budget constraint. If a mistake occurs on time step the instance [xt-,it\ is 
added to the budget with the weight set to yt and a random-based space management policy 
is triggered whenever the budget size grows beyond B. In the pseudocode of Algorithm [3] the 
primitive rnd(-) samples a random element from its set argument. The following theorem is a 
fairly straightforward multitask version of [3', Theorem 5]. Since the algorithm is randomized, 
the theoretical guarantee provided in Theorem [1] bounds the expected value of the (random) 
number of mistakes, rather than the number of mistakes itself. 

Theorem 1 The expected value of the number of mistakes M made by the mtrbp algorithm, 
run with a graph G and with a budget size B > 0, on any finite sequence of n multitask examples 
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{[xi,ii],yi),. . . , {[xn,in],yn), is bounded above by 

for any e E (0, 1) and any sequence of multitask reference classifiers go, ■ ■ ■ ,gn-i £ T^ic such 
that maxt TRACE{Kg^ Ac ) < i^e\fB where Kg^^g^ is the Gram matrix computed among the 
single task classifiers {gt)i, ■ ■ ■ , {gt)k- 

The variable e in Theorem [T] trades off the size of the comparison class for the tightness of the 
bound, the larger the former (which is the case when e leans towards 1), the looser the latter. 
Since Theorem [T] holds for arbitrary choices of e, mtrbp effectively competes against the best 
multitask classifier in the best traded off comparison class. Let us observe how the multitask 
kernel affects bound ([T]). The role of the kernel appears evident in the shifting term, both 
explicitly through the presence of the interaction matrix Aq and implicitly through the weighting 

constant factor cq- Consider the expansion of as Ylt^^i \J {idt ~ 9t-i)AG, {gt — 9t-i))- 

Each term under the square root has the form 

k 

X] II^M -5t-i,if + X] U9t,i - 9t,j) - {gt~i,i - gt-i,j)f (2) 

i=l {iJ)&E 

where the first summand summarizes how much each of the k reference classifiers shifts from 
time step t — 1 to time step t and therefore does not take into account the relations among tasks. 
The second summand of ([2]) is where those relations come into play. Specifically, for each pair of 
related tasks the difference of their relative positions after consecutive time steps is evaluated. 
This expression is clearly small when the reference classifiers of related tasks shift in similar 
way^. In particular, when this shifting pattern holds and G is a complete graph the shifting 

term cgS^i/2 becomes, excluding constant factors, "^t^i and it is therefore 

similar to the one we would get if there were only one task. Aside from the shifting term, the 
impact of the multitask kernel is then largely confined in the comparison class inequality that 
binds B to TRACE{Kg^^g^AG) ■ In fact, setting a given value of B amounts to define the shape 
of the class of multitask reference classifiers the algorithm competes against. After rewriting 
trace{K g^^g^Ac) as Xlili lls't.ill^ + Y^{i,j)eE \\9t,i - 9t,j\\'^ it is easy to see that a given choice of 
B imposes a constraint on the norms of the task-specific reference classifiers and on the spatial 
relations they entertain with each other. To better illustrate how the memory constraint works 
in this respect, consider the following two opposite situations where we again set G to be the 
complete graph. First assume that the worst-case multitask reference classifiers are stationary, 
i.e., gt = g for all t, and their single task classifiers are overlapped, i.e., gi = g2 = • • • = 9k- 
In this case \\gi\\'^ < ^T^B for all i = 1, . . . , A:, excluding constant factors. In other words, the 
algorithm can compete against longer reference classifiers whose norm is nearly B instead of 
B/k as a naive, non multitask approach would imply. Implementation- wise, this means that 
the whole allotted space can be devoted to learn a single, more complex unique task rather 
than inefficiently fragmented to track k equal reference classifiers. On the other hand, if the k 

single task reference classifiers are distant from each otheijfl, as measured by the metrics defined 
1/2 

by A^ , then the average norm of the single task reference vectors the algorithm can compete 
against is reduced by an amount proportional to how much they are spread apart. In this 
case, since the tasks we are learning are different from each other, mtrbp is forced to reserve a 
portion of the available space to each task. 

^Think of a multi language spam classifier. Different trends may arise over time but the language(task) 
relations stay the same. 

*When the reference classifiers have the same norm and G is a complete graph this amounts to say that the 
((?t)i's are the vertices of a fc-simplex centered at the origin. 
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4.3 The Multitask Self- Tuned Forgetron 

Adapting the Forgetron algorithm of [8j to run within the multitask protocol is relatively 
straightforward. Before investigating the details of the algorithm (mtforg) recall that its 
theoretical behavior is sub-optimal, since the size of the comparison class is of the order 
0{^B/log{B)), a factor Y^log(i?) worse than the optimal 0{VB) achieved by mtrbfU. 
Nonetheless, the algorithm may prove to be more effective in a number of real world scenarios 
where the sequence of examples is not adversarial and the policy of always dropping the oldest 
instance may turn up to be beneficial. 

We consider the self-tuned version of the algorithm which performs Perceptron-like updates 
whenever there is still room in the active set and operates as follows otherwise. If a mistake 
occurs at time step t and the current budget size |5| is already B, then the oldest active 
multitask instance [a;^, ir] with r = min( J) is singled out for eviction and the incoming instance 
is loaded into the budget with the weight /3( set to yt (removal step). As a way to 
control the detrimental effect of the removal of [tc,., from the budget, MTFORG also reduces 
the weights /3j's by an adaptive factor (/) (shrinking step) so that older instances have smaller 
weights. Because it is likely that both steps negatively affect the overall performance of the 
algorithm, the rescaling factor (p is adaptively set to moderate the impact of the shrinking step. 
We take advantage of the fact that the norm of multitask instances is bounded by cq, which 
may be much less then 1 for several non trivial graphs, and slightly fine tune the algorithm 
presented in [S] by setting 

Ct) = max ^GiPrVrX, PrXfiriXr)) +Q< 3. (3) 

xe(o,i] V 32 y 

where ^g(-^) A*) = c^A^ + 2cgA — 2A;U. The resulting algorithm is similar to Algorithm [3] where 
line 9 is replaced by 



10 
11 
12 



r ^ min( J) 

S ^ Syj[Xt,it] \ [Xr.ir] 

13 j — )• (pl3j,\/j £ J{(p computed as in ([3])} 

Q ^ Q + ^G{Pryr(t>,Pr(t>fir{Xr)) 



The following theorem, which is an easy consequence of [F, Theorem 3], provides a deterministic 
worst-case upper bound on the number of mistakes made by mtforg. 

Theorem 2 The number of mistakes m made by the mtforg algorithm, run with a graph 
G and with a budget size B > 83, on any finite sequence of n multitask examples 
([a;i,ii],yi), . . . , {[xn,in],yn) satisfies 

B + l 

m < 4L + 



21og(S + l) 



for any multitask reference classifier g such that ^yTRACE(K^^^AG) < 4^ y iog(B+i) '^^^ ^S,9 
is the Gram matrix computed among the single task classifiers gi, . . . ^g^. 

First, observe that the above bound only applies to a stationary comparison multitask classifier. 
As far as we know no shifting analysis is known for the Self- Tuned version of the Forgetron 
algorithm. Nonetheless, a shifting bound can be obtained for an experimentally less appealing 
variant by following the argument and the analysis given in [13]. Second, as for mtrbp, the role 
of the multitask kernel is mainly reflected in the bound through the comparison class inequality 



Y^TRACE(i('gjj4G) < iog(B+i) ' ^^^^ respect, note that it is the presence of cq in ^ that 



''For details on why 0{\fB) is the largest norm an algorithm with a budget size B can compete against see [8]. 
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Table 1: Online training F-measures achieved by the multitask budget algorithms run with G 
set to C (complete graph) or D (totally disconnected graph) and B set to 25%, 10% or 5% of 
the size of the active set obtained by a battery of k Perceptrons on the PKDD 2006 spam task 
A, School binary, and Sentiment data sets. As a reference, the F-measures achieved by the k 
independent Perceptrons are shown below the name of each data set. 





ALGORITHM 


C(25%) 


D(25%) 


C(10%) 


D(10%) 


C(5%) 


D(5%) 


PkddOe Spam A 


MTRBP 


84.8% 


81.4% 


78.0% 


73.1% 


72.8% 


66.7% 


(92.0%) 


MTFORG 


85.1% 


81.7% 


77.9% 


73.0% 


72.3% 


66.7% 




MTBPRJ 


91.6% 


89.4% 


87.8% 


84.0% 


83.4% 


76.0% 




MTBPRJ-2 


91.6% 


90.3% 


89.3% 


87.3% 


85.4% 


82.3% 


School binary 


MTRBP 


40.4% 


35.0% 


38.6% 


30.9% 


37.3% 


26.0% 


(39.1%) 


MTFORG 


39.7% 


35.1% 


38.0% 


31.5% 


36.9% 


25.9% 




MTBPRJ 


40.6% 


37.4% 


40.2% 


32.6% 


39.4% 


23.8% 




MTBPRJ-2 


41.2% 


39.1% 


40.9% 


39.0% 


39.6% 


37.9% 


Sentiment 


MTRBP 


66.8% 


63.3% 


62.0% 


58.3% 


59.4% 


56.1% 


(71.5) 


MTFORG 


66.7% 


63.4% 


62.3% 


58.8% 


58.8% 


56.5% 




MTBPRJ 


71.4% 


67.7% 


66.6% 


63.6% 


64.0% 


60.6% 




MTBPRJ-2 


71.7% 


68.5% 


67.9% 


64.9% 


64.5% 


61.7% 



allows the size of the comparison class to scale as a function of the tasks through the constant 
factor 1 / 4cg (see subsection 14.21 for a detailed discussion on the comparison class inequality 
within the multitask framework). Third, observe that when a mistake occurs and |5| is B, then 
the removal step only affects those tasks that are related to ir, since IC{[xr,ir],[xj,ij]) = 
if tasks ir and ij's are unrelated. As a result, tasks belonging to connected components that 
seldom need processing can be quickly forgotten altogether. Combining the Forgetron budget 
maintenance policy with the multitask kernel has thus the important effect that the allocation 
of the available space to the most frequent tasks is automatically taken care of. 

4.4 Implementation details 

While the implementation of the budget algorithms discussed in this paper is straightforward, 
it is still useful to point out a few remarks. Except for mtprj-2, all the algorithms discussed 
in this paper only maintain B real weights and B multitask instance vectors regardless of the 
number k of the tasks at hand. Of course the matrix Aq employed by the multitask kernel may 
still require 0{k'^) space. Note, however, that if G is not overly complex and exhibits a certain 
regularity, the required space may be much smaller, and of course it may be more efficient 
to opt for a programmatic implementation of Aq over the naive table-based approach. Both 
MTRBP and MTFORG require 0(1) operations to update their internal state on mistaken rounds, 
whereas it is not hard to show that MTPRJ take 0{B'^) operations. As for mtprj-2 we should 
note that the algorithm only needs to store B real weights for each connected component of G 
which results in a much milder dependence on G than a naive implementation would imply. 

5 Experiments 

The experimental performance of budget multitask algorithms on non-synthetic data sets is of 
key importance because the constraint on the size of the budget and the favorable dependence 
on the number of tasks make them suited even for large scale applications. In this section 
we evaluate the multitask budget algorithms discussed in Section S] over three data sets, the 
PKDD 2006 Spam Task A [5] data set {k = 3, d = 106780, n = 7500), the School [6] data set 
{k = 139, d = 28, n = 15362) and the Sentiment [7j data set {k = A, d = 473856, n = 8000). 
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Since examples in the School data set have real valued labels, a preprocessing was required to 
turn those into binary values. Therefore, we assigned a positive label to those instances whose 
original score was above the 75-th percentile and a negative label to those instances whose 
original score was below that threshold. Moreover we rescaled non binary features in the range 
[0, 1]. We used a Gaussian kernel for the School data set and a polynomial kernel for the PKDD 
2006 Spam Task A and Sentiment data sets. 

The simple yardstick for our algorithms is the online classifier that runs a battery of k 
Perceptrons in parallel with no constraints on the size of their active sets. We evaluated the 
budget multitask algorithms for different sizes of their budget. In particular, we set B to 25%, 
10% and 5% of the size of the active set obtained by the baseline algorithm after a single training 
epoch. For mtbprj and mtbprj-2 we set r] = 0.01. 

Imposing a budget restriction should be detrimental to the overall performance and it should 
be even more so when k rather than a single task are to be processed. On the other hand, a 
proper multitask formulation should lessen and ideally negate, this impact. The F-measure 
values achieved after a single pass over the training set are reported in Table [TJ The numbers 
show that disregarding multitask information (i.e., G = D), and imposing a constraint on the 
size of the active set, really negatively affects the performance of budget multitask algorithms, 
and this behavior is unsurprisingly shared by all algorithms. Even in this case, however, the 
projection based algorithms tend to present a relatively better behavior confirming that the 
projection schemes are a first step towards a better retention. Specifically, the global projection 
step employed by mtbprj-2 turns out to be particularly effective, as evidenced by comparing 
the F-measures obtained on the School data set, and to a lesser extent on the PKDD 2006 Spam 
Task A data set, when B = 5%. 

It is of course more interesting to see how this decrease in the overall performance can be 
offset by taking into account multitask relations. In fact, setting G = C results in better F- 
measure values for all three data sets and for all algorithms. Moreover the differences in the 
F-measures obtained for the different choices of G grow larger as the budget size is shrunk to 
smaller values. This should not come to a surprise, since when the available memory is scarce 
an efficient management policy, which is the main benefit that the multitask kernel brings to 
budget algorithms, becomes crucial. It is particularly surprising that the multitask kernel can 
be so effective that for B = 25% all four algorithms match the performance of the baseline and, 
for MTBPRJ and MTBPRJ-2 this holds true even when B goes down to 5%. 



6 Appendix 

Proof of Proposition 3.1: First, observe that the Laplacian matrix for the augmented graph 
G' is 

Ag -1 



Lg' 



k 



where 1 is the vector of all ones. We use Theorem 3.3.2] to compute the pseudoinverse of 
Ag> 



A + 



[I + k)-^A7}ll^ AT.^ - (1 + k)-^A-Hl^A7} -(1 + k)-^A-H 



[l + k)-^l^ A^ 



G 



+ 



l^A-^^1 

JT+W 



l^A-^^ 1 
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Denoting with Cj the i-th standard vector of size k and observing that AqI = (I + Lq)! = 1 + 
since Lq is a Laplacian matrix, we have, for ah i = 1, . . . , /c and j = 1, . . . ,k 



-1. 



+ 



2 + k 



{l + kf 



Finally, by [101 Theorem 7] it holds that 

1 



Ac 



Rg' 



1 + k 



Rg'11' + ll'i?G' + 



{l + k} 



;11 ' i?G'll 



Substituting ([5]) back into ^ yields the desired result. 



(4) 

(5) 
□ 
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